---
title: Customer Support Quality Assurance
description: Implement a comprehensive QA system for customer support responses using Rating-based LLM evaluation, Exact Match rules, and Manual review
---

<Card
title="Live example"
href="https://app.latitude.so/share/d/3cb9571e-8022-415c-9140-9d729fc0b155"
arrow="true"
cta="Copy to your Latitude">
Try out this agent setup in the Latitude Playground.
</Card>

## Overview

This tutorial demonstrates how to build a quality assurance system for customer support responses using three specific Latitude evaluation types:

- **LLM-as-Judge**: Rating evaluation for helpfulness assessment
- **Programmatic Rules** with Exact Match for required information validation
- **Human-in-the-Loop** manual evaluation for customer satisfaction scoring

## The Prompt

This is the prompt that will be used to generate customer support responses. It is a simple prompt that takes a customer query and generates a response. It doesn't use a knowledge base or any additional information.

<CodeGroup>
```markdown main
---
provider: OpenAI
model: gpt-4.1
---

You are a helpful customer support agent. Respond to the customer inquiry below with empathy and provide a clear solution.

Customer inquiry: {{customer_message}}
Customer tier: {{tier}}
Product: {{product_name}}

Requirements:
- Always include the ticket number: {{ticket_number}}
- Address the customer by name if provided
- Provide specific next steps
- End with "Is there anything else I can help you with today?"
```
</CodeGroup>

<Note>
In this example, the prompt is very simple, but you could also upload documents to OpenAI and use their new [Responses API file search](https://platform.openai.com/docs/guides/tools-file-search). This implements a knowledge base search that can be used to find relevant information in the documents, so your responses to customer support queries can be based on actual documentation. However, this is out of the scope of this tutorial.
</Note>

## The Evaluations

To create new evaluations, go to the evaluations tab in the Latitude Playground and click on "Add Evaluation".
![Evaluations](/assets/cases/evaluations-button.png)

<AccordionGroup>
<Accordion title="Helpfulness Assessment (LLM-as-Judge)">
    This is how we configure an LLM-as-Judge evaluation to assess the helpfulness of customer support responses.
    <Steps>
      <Step title="Configure the evaluation">
    This evaluation uses the Rating metric from the AI to assess response quality, with criteria such as **Assess how well the response follows the given instructions** and a 1-5 rating scale where 1 means **Not faithful, doesn't follow the instructions** and 5 means **Very faithful, follows the instructions**.
       <Expandable title="LLM-as-Judge Evaluation modal image">
       ![LLM-as-Judge Rating Evaluation](/assets/cases/helpfulness-assetment-evaluation.png)
       </Expandable>
      </Step>
      <Step title="Create an experiment from the evaluation">
        An [Experiment](/guides/experiments/overview) is a way of running the prompt many times and validating, with this evaluation, if it passes the criteria.
        Before creating the experiment, we need to create a dataset. Click on "Generate dataset".

        <Expandable title="Experiment modal image">
            ![Experiment modal](/assets/cases/experiment-modal.png)
        </Expandable>
      </Step>
      <Step title="Create the synthetic dataset">
        A synthetic dataset is generated by the system to test the evaluation. It allows us to test the evaluation without having to create a real dataset. It sets columns for each parameter in our prompt.
        <Expandable title="Generate dataset modal image">
        ![Generate dataset](/assets/cases/generate-synthetic-dataset.png)
        </Expandable>
      </Step>
      <Step title="Run the experiment">
        Once we have the dataset, select it in the dataset selector and click "Run experiment".
        <Expandable title="Select dataset in experiment image">
        ![Run experiment](/assets/cases/select-dataset-in-experiment.png)
        </Expandable>
        <Note>You can see how the columns in this dataset have to match the parameters in our prompt.</Note>
      </Step>
      <Step title="View experiment results">
        After running the experiment with 30 rows of the synthetic dataset you just created, you can see the results! The green counter shows the successful cases. Yellow represents results that failed the evaluation, and red means errors occurred during the experiment run.
        <Expandable title="Experiment results image">
        ![Experiment results](/assets/cases/experiment-results.png)
        </Expandable>
      </Step>
    </Steps>

</Accordion>
<Accordion title="Required Information Validation (Programmatic Rule - Exact Match)">
The goal of this evaluation is to ensure every response contains mandatory elements like ticket numbers and proper closing statements. Let's set it up.
<Steps>
<Step title="Configure the evaluation">
<Note>This rule cannot be used with your real logs. It needs an **expected output** to match.</Note>
<Expandable title="Programmatic rule Evaluation modal image">
![Programmatic rule with exact match](/assets/cases/programmatic-rule.png)
</Expandable>
</Step>
<Step title="Create dataset with expected output">
We need to create another dataset, but this time it must have an **expected output** column.
You can use the same dataset but add a new column with the expected output. In this case, we want to ensure our prompt always responds with the sentence **Is there anything else I can help you with today?**
<Expandable title="Dataset with expected output selector image">
![Dataset with expected output](/assets/cases/dataset-expected-output.png)
</Expandable>
</Step>
</Steps>
</Accordion>
<Accordion title="Contains Ticket Number (Programmatic Rule - Regular Expression)">
<Steps>
<Step title="Configure the evaluation">
To configure this evaluation, we use a regular expression to ensure the customer support response contains a ticket number.
So in this case, we require:

1. The ticket number starts with `TCKT-`
2. Followed by 4 digits (`-\d{4}`)

This is the shape of our ticket column in the dataset.
<Expandable title="Dataset ticket column image">
![Dataset ticket format](/assets/cases/dataset-ticket-format.png)
</Expandable>
Now we're ready to create this new evaluation.
<Expandable title="Regular expression Evaluation modal image">
![Regular expression modal](/assets/cases/regular-expression-modal.png)
</Expandable>
</Step>
<Step title="Run the experiment">
This step is the same as for the first evaluation. We create an experiment and see the results. In this case, we should see that the AI responded with the ticket number because it's part of our prompt. This is a basic check, but ensures future modifications to the prompt keep the ticket number.
</Step>
</Steps>
</Accordion>
<Accordion title="Manual Evaluation (HITL - Human in the Loop)">
<Steps>
<Step title="Configure the evaluation">
Customer satisfaction involves nuanced judgment about tone, cultural sensitivity, and domain-specific accuracy that automated systems might miss, making it perfect for human evaluation.
<Expandable title="Manual evaluation modal image">
![HITL Configuration](/assets/cases/human-in-the-loop-configuration.png)
</Expandable>
</Step>
<Step title="Annotate past conversations (logs)">
The first way to enable human evaluators to review responses is to give them access to Latitude's logs.
When they click on the logs in the right panel, now that we've configured the HITL evaluation, they will be able to assign a score from 1 to 5 as previously configured.
<Expandable title="Manual evaluation on latitude logs">
![HITL from Latitude Logs](/assets/cases/manual-evaluation-from-logs.png)
</Expandable>
</Step>
<Step title="Annotate with the SDK">
Another way to add manual evaluations is to use the Latitude SDK. You
can see an example of [how to do it here](/examples/sdk/annotate-log).
</Step>
<Step title="Minimum score">
One thing we didn't do when configuring the evaluation is to set a minimum score required to pass. Let's do it now: Go to the manual evaluation detail at the top right of the screen and click **Settings**.
<Expandable title="Min score configuration">
![HITL min score configuration](/assets/cases/hitl-setting-min-score-threshold.png)
</Expandable>
</Step>
<Step title="Manual evaluation results">
Now our human evaluator has scored the responses and we can see the results in the experiment.
In the image, we see an evaluation with score `1` but in green. This was before we set the minimum score to `3`. The next one didn't pass and is shown in red.
<Expandable title="Min score configuration">
![HITL results table](/assets/cases/hitl-results-table.png)
</Expandable>
</Step>
</Steps>
</Accordion>
</AccordionGroup>

## Live Mode

We've done a lot of work so far. We set up four types of evaluations but only tested against synthetic data. Now we want to test our evaluations against real customer interactions—this is what we call **Live Mode**.

Let's set the **Helpfulness Assessment** evaluation to live mode. Go to the evaluation's detail, click the top right corner **Settings**, and at the bottom under **Advanced configuration**, you can see the **Evaluate live logs** toggle.
![Live logs configuration](/assets/cases/live-logs-toggle.png)

We did the same for the **Contains Ticket Number** programmatic rule evaluation.

<Note>**Manual** evaluations can't be set to live mode because human evaluators review the responses manually after the AI responds to the customer. The **Required Information Validation** evaluation is also not suitable because it requires an expected output to match against the AI response.</Note>

![HITL results table](/assets/cases/evaluation-list.png)

## Conclusion

By setting up a robust evaluation framework for customer support responses, we've learned how different types of automated and manual evaluations work together to ensure high-quality service. Automated LLM-based ratings help us assess response helpfulness at scale, while programmatic rules—like exact match and regular expressions—ensure critical information such as ticket numbers and required statements are always included. Human-in-the-loop (manual) evaluations provide the nuanced judgment that only real people can offer, especially for customer satisfaction and tone. Testing our system with both synthetic and real data (live mode) gives us confidence that our evaluations are both reliable and effective. Ultimately, these evaluations help us catch issues early, improve our AI prompts, and consistently deliver accurate and customer-friendly support—leading to better customer satisfaction and operational excellence.

## Resources

- [LLM-as-Judge Evaluation](/guides/evaluations/llm-as-judges) — How to use LLMs to evaluate responses
- [Programmatic Rule Evaluation](/guides/evaluations/programmatic-rules) — How to use programmatic rules to evaluate responses
- [Human-in-the-Loop Evaluation](/guides/evaluations/humans-in-the-loop) — How to use human evaluators to evaluate responses
- [Running Evaluations](/guides/evaluations/running-evaluations) — How to run evaluations against synthetic and live data
- [Datasets](/guides/datasets/overview) — How to create datasets for evaluations
